{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Qwen2-VL-2B-Instruct Lora 微调\n",
    "\n",
    "本节我们将简要介绍如何基于 `transformers` 和 `peft` 等框架，使用 Qwen2-VL-2B-Instruct 模型在 **COCO2014图像描述** 任务上进行 Lora 微调训练。Lora 是一种高效的微调方法，若需深入了解 Lora 的工作原理，可参考博客：[知乎|深入浅出 Lora](https://zhuanlan.zhihu.com/p/650197598)。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🌍 环境配置\n",
    "\n",
    "考虑到部分同学在配置环境时可能会遇到一些问题，我们在 AutoDL 平台上提供了预装了 Qwen2-VL 环境的镜像。点击下方链接并直接创建 Autodl 示例即可快速开始：[AutoDL-Qwen2-VL-self-llm](https://www.codewithgpu.com/i/datawhalechina/self-llm/Qwen2-VL-self-llm)。\n",
    "\n",
    "\n",
    "## 📚 准备数据集\n",
    "\n",
    "本节使用的是 [COCO 2014 Caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary) 数据集，该数据集主要用于多模态（Image-to-Text）任务。\n",
    "\n",
    "> 数据集介绍：COCO 2014 Caption数据集是Microsoft Common Objects in Context (COCO)数据集的一部分，主要用于图像描述任务。该数据集包含了大约40万张图像，每张图像都有至少1个人工生成的英文描述语句。这些描述语句旨在帮助计算机理解图像内容，并为图像自动生成描述提供训练数据。\n",
    "\n",
    "![05-2](./images/05-2.jpg)\n",
    "\n",
    "在本节的任务中，我们主要使用其中的前500张图像，并对它们进行处理和格式调整，目标是组合成如下格式的JSON文件：\n",
    "\n",
    "**数据集下载与处理方式**\n",
    "\n",
    "1. **我们需要做四件事情：**\n",
    "    - 通过Modelscope下载COCO 2014 Caption数据集\n",
    "    - 加载数据集，将图像保存到本地\n",
    "    - 将图像路径和描述文本转换为一个CSV文件\n",
    "    - 将CSV文件转换为JSON文件\n",
    "\n",
    "2. **使用下面的代码完成从数据下载到生成CSV的过程：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入所需的库\n",
    "from modelscope.msdatasets import MsDataset\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "MAX_DATA_NUMBER = 500\n",
    "\n",
    "# 检查目录是否已存在\n",
    "if not os.path.exists('coco_2014_caption'):\n",
    "    # 从modelscope下载COCO 2014图像描述数据集\n",
    "    ds =  MsDataset.load('modelscope/coco_2014_caption', subset_name='coco_2014_caption', split='train')\n",
    "    print(len(ds))\n",
    "    # 设置处理的图片数量上限\n",
    "    total = min(MAX_DATA_NUMBER, len(ds))\n",
    "\n",
    "    # 创建保存图片的目录\n",
    "    os.makedirs('coco_2014_caption', exist_ok=True)\n",
    "\n",
    "    # 初始化存储图片路径和描述的列表\n",
    "    image_paths = []\n",
    "    captions = []\n",
    "\n",
    "    for i in range(total):\n",
    "        # 获取每个样本的信息\n",
    "        item = ds[i]\n",
    "        image_id = item['image_id']\n",
    "        caption = item['caption']\n",
    "        image = item['image']\n",
    "        \n",
    "        # 保存图片并记录路径\n",
    "        image_path = os.path.abspath(f'coco_2014_caption/{image_id}.jpg')\n",
    "        image.save(image_path)\n",
    "        \n",
    "        # 将路径和描述添加到列表中\n",
    "        image_paths.append(image_path)\n",
    "        captions.append(caption)\n",
    "        \n",
    "        # 每处理50张图片打印一次进度\n",
    "        if (i + 1) % 50 == 0:\n",
    "            print(f'Processing {i+1}/{total} images ({(i+1)/total*100:.1f}%)')\n",
    "\n",
    "    # 将图片路径和描述保存为CSV文件\n",
    "    df = pd.DataFrame({\n",
    "        'image_path': image_paths,\n",
    "        'caption': captions\n",
    "    })\n",
    "    \n",
    "    # 将数据保存为CSV文件\n",
    "    df.to_csv('./coco-2024-dataset.csv', index=False)\n",
    "    \n",
    "    print(f'数据处理完成，共处理了{total}张图片')\n",
    "\n",
    "else:\n",
    "    print('coco_2014_caption目录已存在,跳过数据处理步骤')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意：将代码中的`\"测试图像路径\"`替换为你自己希望测试的图像路径。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. **在同一目录下，用以下代码，将csv文件转换为json文件：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "# 载入CSV文件\n",
    "df = pd.read_csv('./coco-2024-dataset.csv')\n",
    "conversations = []\n",
    "\n",
    "# 添加对话数据\n",
    "for i in range(len(df)):\n",
    "    conversations.append({\n",
    "        \"id\": f\"identity_{i+1}\",\n",
    "        \"conversations\": [\n",
    "            {\n",
    "                \"from\": \"user\",\n",
    "                \"value\": f\"COCO Yes: <|vision_start|>{df.iloc[i]['image_path']}<|vision_end|>\"\n",
    "            },\n",
    "            {\n",
    "                \"from\": \"assistant\", \n",
    "                \"value\": df.iloc[i]['caption']\n",
    "            }\n",
    "        ]\n",
    "    })\n",
    "\n",
    "# 保存为json\n",
    "with open('data_vl.json', 'w', encoding='utf-8') as f:\n",
    "    json.dump(conversations, f, ensure_ascii=False, indent=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此时目录下会多出两个文件：\n",
    "- coco-2024-dataset.csv\n",
    "- data_vl.json\n",
    "\n",
    "至此，我们完成了数据集的准备。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🤖 模型下载与加载\n",
    "\n",
    "\n",
    "这里使用 `modelscope` 提供的 `snapshot_download` 函数进行下载，该方法对国内的用户十分友好。然后把它加载到Transformers中进行训练：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from modelscope import snapshot_download, AutoTokenizer\n",
    "from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq, Qwen2VLForConditionalGeneration, AutoProcessor\n",
    "import torch\n",
    "\n",
    "# 在modelscope上下载Qwen2-VL模型到本地目录下\n",
    "model_dir = snapshot_download(\"Qwen/Qwen2-VL-2B-Instruct\", cache_dir=\"./\", revision=\"master\")\n",
    "\n",
    "# 使用Transformers加载模型权重\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct/\", use_fast=False, trust_remote_code=True)\n",
    "# 特别的，Qwen2-VL-2B-Instruct模型需要使用Qwen2VLForConditionalGeneration来加载\n",
    "model = Qwen2VLForConditionalGeneration.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct/\", device_map=\"auto\", torch_dtype=torch.bfloat16, trust_remote_code=True,)\n",
    "model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型大小约 4.5GB，下载模型大概需要 5 - 10 分钟。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "bat"
    }
   },
   "source": [
    "## 🚀 开始微调\n",
    "\n",
    "**本节代码做了以下几件事：**\n",
    "1. 下载并加载 `Qwen2-VL-2B-Instruct` 模型\n",
    "2. 加载数据集，取前496条数据参与训练，4条数据进行主观评测\n",
    "3. 配置Lora，参数为r=64, lora_alpha=16, lora_dropout=0.05\n",
    "4. 训练2个epoch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import torch\n",
    "from datasets import Dataset\n",
    "from modelscope import snapshot_download, AutoTokenizer\n",
    "from qwen_vl_utils import process_vision_info\n",
    "from peft import LoraConfig, TaskType, get_peft_model, PeftModel\n",
    "from transformers import (\n",
    "    TrainingArguments,\n",
    "    Trainer,\n",
    "    DataCollatorForSeq2Seq,\n",
    "    Qwen2VLForConditionalGeneration,\n",
    "    AutoProcessor,\n",
    ")\n",
    "import json\n",
    "\n",
    "def process_func(example):\n",
    "    \"\"\"\n",
    "    将数据集进行预处理\n",
    "    \"\"\"\n",
    "    MAX_LENGTH = 8192\n",
    "    input_ids, attention_mask, labels = [], [], []\n",
    "    conversation = example[\"conversations\"]\n",
    "    input_content = conversation[0][\"value\"]\n",
    "    output_content = conversation[1][\"value\"]\n",
    "    file_path = input_content.split(\"<|vision_start|>\")[1].split(\"<|vision_end|>\")[0]  # 获取图像路径\n",
    "    messages = [\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"image\",\n",
    "                    \"image\": f\"{file_path}\",\n",
    "                    \"resized_height\": 280,\n",
    "                    \"resized_width\": 280,\n",
    "                },\n",
    "                {\"type\": \"text\", \"text\": \"COCO Yes:\"},\n",
    "            ],\n",
    "        }\n",
    "    ]\n",
    "    text = processor.apply_chat_template(\n",
    "        messages, tokenize=False, add_generation_prompt=True\n",
    "    )  # 获取文本\n",
    "    image_inputs, video_inputs = process_vision_info(messages)  # 获取数据数据（预处理过）\n",
    "    inputs = processor(\n",
    "        text=[text],\n",
    "        images=image_inputs,\n",
    "        videos=video_inputs,\n",
    "        padding=True,\n",
    "        return_tensors=\"pt\",\n",
    "    )\n",
    "    inputs = {key: value.tolist() for key, value in inputs.items()} #tensor -> list,为了方便拼接\n",
    "    instruction = inputs\n",
    "\n",
    "    response = tokenizer(f\"{output_content}\", add_special_tokens=False)\n",
    "\n",
    "\n",
    "    input_ids = (\n",
    "            instruction[\"input_ids\"][0] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
    "    )\n",
    "\n",
    "    attention_mask = instruction[\"attention_mask\"][0] + response[\"attention_mask\"] + [1]\n",
    "    labels = (\n",
    "            [-100] * len(instruction[\"input_ids\"][0])\n",
    "            + response[\"input_ids\"]\n",
    "            + [tokenizer.pad_token_id]\n",
    "    )\n",
    "    if len(input_ids) > MAX_LENGTH:  # 做一个截断\n",
    "        input_ids = input_ids[:MAX_LENGTH]\n",
    "        attention_mask = attention_mask[:MAX_LENGTH]\n",
    "        labels = labels[:MAX_LENGTH]\n",
    "\n",
    "    input_ids = torch.tensor(input_ids)\n",
    "    attention_mask = torch.tensor(attention_mask)\n",
    "    labels = torch.tensor(labels)\n",
    "    inputs['pixel_values'] = torch.tensor(inputs['pixel_values'])\n",
    "    inputs['image_grid_thw'] = torch.tensor(inputs['image_grid_thw']).squeeze(0)  #由（1,h,w)变换为（h,w）\n",
    "    return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels,\n",
    "            \"pixel_values\": inputs['pixel_values'], \"image_grid_thw\": inputs['image_grid_thw']}\n",
    "\n",
    "def predict(messages, model):\n",
    "    # 准备推理\n",
    "    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    image_inputs, video_inputs = process_vision_info(messages)\n",
    "    inputs = processor(\n",
    "        text=[text],\n",
    "        images=image_inputs,\n",
    "        videos=video_inputs,\n",
    "        padding=True,\n",
    "        return_tensors=\"pt\",\n",
    "    )\n",
    "    inputs = inputs.to(\"cuda\")\n",
    "\n",
    "    # 生成输出\n",
    "    generated_ids = model.generate(**inputs, max_new_tokens=128)\n",
    "    generated_ids_trimmed = [\n",
    "        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n",
    "    ]\n",
    "    output_text = processor.batch_decode(\n",
    "        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n",
    "    )\n",
    "    \n",
    "    return output_text[0]\n",
    "\n",
    "# 使用Transformers加载模型权重\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct/\", use_fast=False, trust_remote_code=True)\n",
    "processor = AutoProcessor.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct\")\n",
    "\n",
    "model = Qwen2VLForConditionalGeneration.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct/\", device_map=\"auto\", torch_dtype=torch.bfloat16, trust_remote_code=True,)\n",
    "model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法\n",
    "\n",
    "# 处理数据集：读取json文件\n",
    "# 拆分成训练集和测试集，保存为data_vl_train.json和data_vl_test.json\n",
    "train_json_path = \"data_vl.json\"\n",
    "with open(train_json_path, 'r') as f:\n",
    "    data = json.load(f)\n",
    "    train_data = data[:-4]\n",
    "    test_data = data[-4:]\n",
    "\n",
    "with open(\"data_vl_train.json\", \"w\") as f:\n",
    "    json.dump(train_data, f)\n",
    "\n",
    "with open(\"data_vl_test.json\", \"w\") as f:\n",
    "    json.dump(test_data, f)\n",
    "\n",
    "train_ds = Dataset.from_json(\"data_vl_train.json\")\n",
    "train_dataset = train_ds.map(process_func)\n",
    "\n",
    "# 配置LoRA\n",
    "config = LoraConfig(\n",
    "    task_type=TaskType.CAUSAL_LM,\n",
    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
    "    inference_mode=False,  # 训练模式\n",
    "    r=64,  # Lora 秩\n",
    "    lora_alpha=16,  # Lora alaph，具体作用参见 Lora 原理\n",
    "    lora_dropout=0.05,  # Dropout 比例\n",
    "    bias=\"none\",\n",
    ")\n",
    "\n",
    "# 获取LoRA模型\n",
    "peft_model = get_peft_model(model, config)\n",
    "\n",
    "# 配置训练参数\n",
    "args = TrainingArguments(\n",
    "    output_dir=\"./output/Qwen2-VL-2B\",\n",
    "    per_device_train_batch_size=2,\n",
    "    gradient_accumulation_steps=2,\n",
    "    logging_steps=10,\n",
    "    num_train_epochs=2,\n",
    "    save_steps=100,\n",
    "    learning_rate=1e-4,\n",
    "    save_on_each_node=True,\n",
    "    gradient_checkpointing=True,\n",
    "    report_to=\"none\",\n",
    ")\n",
    "        \n",
    "# 配置Trainer\n",
    "trainer = Trainer(\n",
    "    model=peft_model,\n",
    "    args=args,\n",
    "    train_dataset=train_dataset,\n",
    "    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
    ")\n",
    "# 开启模型训练\n",
    "trainer.train()\n",
    "\n",
    "# ===测试模式===\n",
    "# 配置测试参数\n",
    "val_config = LoraConfig(\n",
    "    task_type=TaskType.CAUSAL_LM,\n",
    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
    "    inference_mode=True,  # 训练模式\n",
    "    r=64,  # Lora 秩\n",
    "    lora_alpha=16,  # Lora alaph，具体作用参见 Lora 原理\n",
    "    lora_dropout=0.05,  # Dropout 比例\n",
    "    bias=\"none\",\n",
    ")\n",
    "\n",
    "# 获取测试模型\n",
    "val_peft_model = PeftModel.from_pretrained(model, model_id=\"./output/Qwen2-VL-2B/checkpoint-100\", config=val_config)\n",
    "\n",
    "# 读取测试数据\n",
    "with open(\"data_vl_test.json\", \"r\") as f:\n",
    "    test_dataset = json.load(f)\n",
    "\n",
    "test_image_list = []\n",
    "for item in test_dataset:\n",
    "    input_image_prompt = item[\"conversations\"][0][\"value\"]\n",
    "    # 去掉前后的<|vision_start|>和<|vision_end|>\n",
    "    origin_image_path = input_image_prompt.split(\"<|vision_start|>\")[1].split(\"<|vision_end|>\")[0]\n",
    "    \n",
    "    messages = [{\n",
    "        \"role\": \"user\", \n",
    "        \"content\": [\n",
    "            {\n",
    "                \"type\": \"image\", \n",
    "                \"image\": origin_image_path\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": \"COCO Yes:\"\n",
    "            }\n",
    "        ]}]\n",
    "    \n",
    "    response = predict(messages, val_peft_model)\n",
    "    messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
    "    print(messages[-1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "看到下面的进度条即代表训练开始：\n",
    "\n",
    "![alt text](./images/04-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🧐 推理LoRA微调后的模型\n",
    "\n",
    "加载LoRA微调后的模型，并进行推理。\n",
    "\n",
    "**完整代码如下：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import Qwen2VLForConditionalGeneration, AutoProcessor\n",
    "from qwen_vl_utils import process_vision_info\n",
    "from peft import PeftModel, LoraConfig, TaskType\n",
    "\n",
    "config = LoraConfig(\n",
    "    task_type=TaskType.CAUSAL_LM,\n",
    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
    "    inference_mode=True,\n",
    "    r=64,  # Lora 秩\n",
    "    lora_alpha=16,  # Lora alaph，具体作用参见 Lora 原理\n",
    "    lora_dropout=0.05,  # Dropout 比例\n",
    "    bias=\"none\",\n",
    ")\n",
    "\n",
    "# default: Load the model on the available device(s)\n",
    "model = Qwen2VLForConditionalGeneration.from_pretrained(\n",
    "    \"./Qwen/Qwen2-VL-2B-Instruct\", torch_dtype=\"auto\", device_map=\"auto\"\n",
    ")\n",
    "model = PeftModel.from_pretrained(model, model_id=\"./output/Qwen2-VL-2B/checkpoint-100\", config=config)\n",
    "processor = AutoProcessor.from_pretrained(\"./Qwen/Qwen2-VL-2B-Instruct\")\n",
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\n",
    "                \"type\": \"image\",\n",
    "                \"image\": \"测试图像路径\",\n",
    "            },\n",
    "            {\"type\": \"text\", \"text\": \"COCO Yes:\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "# Preparation for inference\n",
    "text = processor.apply_chat_template(\n",
    "    messages, tokenize=False, add_generation_prompt=True\n",
    ")\n",
    "image_inputs, video_inputs = process_vision_info(messages)\n",
    "inputs = processor(\n",
    "    text=[text],\n",
    "    images=image_inputs,\n",
    "    videos=video_inputs,\n",
    "    padding=True,\n",
    "    return_tensors=\"pt\",\n",
    ")\n",
    "inputs = inputs.to(\"cuda\")\n",
    "\n",
    "# Inference: Generation of the output\n",
    "generated_ids = model.generate(**inputs, max_new_tokens=128)\n",
    "generated_ids_trimmed = [\n",
    "    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n",
    "]\n",
    "output_text = processor.batch_decode(\n",
    "    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n",
    ")\n",
    "print(output_text)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
